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Abstract 

The Laplace approximation is an old, but frequently used method to approximate 
integrals for Bayesian calculations. In this paper we develop an extension of the Laplace 
approximation, by applying it iteratively to the residual, i.e., the difference between 
the current approximation and the true function. The final approximation is thus a 
linear combination of multivariate normal densities, where the coefficients are chosen 
to achieve a good fit to the target distribution. We illustrate on real and artificial 
examples that the proposed procedure is a computationally efficient alternative to 
current approaches for approximation of multivariate probability densities. 
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1 Introduction 



Suppose you are given a positive integrable function 7r{x) on M^, with unknown normalization 
constant Z = J 7r{x)dx and it is desired to approximate the probabihty distribution 'k{x)/Z. 
In the context of Bayesian statistics this could for example be the posterior distribution. The 
idea of the standard Laplace approximation is to maximize log(7r(a;)), resulting in the mode 
X {i.e. a point with gradient 0) and to approximate log(7r(a;)) by a second order Taylor 
approximation in x (assuming the necessary derivatives of log(7r(a;)) exist). This leads to 
an approximation of the form 

Tx{x) ^ exp(log(7r(5;)) + l/2(x — x)' H{x){x — a;)), 

where H{x) = 9 log 7r{x) /dxidxj is the Hesse matrix of log(7r(a;)) evaluated at x. Essentially, 
it{x) is approximated by the kernel of a multivariate normal distribution with mean x and 
covariance matrix S = —H[x)^^. The normalization constant of this approximation is 
(27r)^''^|S|^/^7r(a;), which itself approximates Z. At first sight it might appear simplistic 
to approximate any distribution by a normal distribution, but in the context of Bayesian 
statistics this approach is justified by the asymptotic normality of posterior distributions, and 
often works also for moderate sample sizes; see, for example, Evans & Swartz (2000, Chapter 
4) or Tierney & Kadane (1986) for more details and DiCiccio, Kass, Raftery & Wasserman 
(1997) or Nott, Kohn & Fielding (2008) for further work in the context of approximating the 
normalization constant. The recent article by Rue, Martino & Chopin (2009) successfully 
uses Laplace approximations for latent variables in latent Gaussian models to approximate 
marginal densities, while Haran & Tierney (2010) use the Laplace approximation to build 
automatic MCMC algorithms in a related model class. 

The Laplace approximation is always unimodal and elliptical. In the case of multiple modes, 
one can partially overcome this problem by fitting Laplace approximations to each mode (see 
Gelman, Carlin, Stern & Rubin (2003, Chapter 12)). Nevertheless, non-elliptical skew pos- 
terior distributions remain a challenge and relatively few papers try to improve the Laplace 
approximation in this regard. One approach is to employ third order derivatives in the Tay- 
lor expansion (see for example O'Hagan & Forster (2004, p. 238-239)). Unfortunately, these 
can be hard to calculate, particularly in larger dimensions. Due to dominance of the cubic 
terms, the approximation might also diverge in the tails, so that it might not be integrable. 
Another approach is to center the normal approximation on the first two moments, instead 
of mode and negative inverse Hessian at the mode (Minka 2001). Nott, Fielding & Leonte 
(2009) consider to improve the Laplace approximation based on numerical integration, in 
particular for approximating Z. 
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In this paper the idea of iterating the Laplace approximation is developed. Given an approx- 
imation 7r{x) of 7r(a;), the approximation of 7r{x) is improved by fitting a Laplace approxima- 
tion to the residual r{x) = 7r{x) — 7t{x). The current approximation Tf{x) fits worst where 
r{x) has its maximum, so that the resulting residual Laplace approximation corrects the 
deficiencies of fit between 7i{x) and 7i{x). The new approximation is then given by a linear 
combination of the starting approximation 7r{x) and the residual Laplace approximation. 
This process is repeated until the approximation does not change considerably. The idea 
of improving an approximation by fitting the "residuals" of the current approximation and 
then using a linear combination of the obtained functions, has been applied elsewhere in the 
statistical literature. It is, for example, at the core of the relaxed greedy algorithm described 
in Barron, Cohen, Dahmen & DeVore (2008) (who use dictionaries to model the individual 
functions) or the boosting technique in machine learning, where this idea is usually called 
functional gradient descent, see Biihlmann & Yu (2003). The difference to the procedure 
presented here is that the Laplace approximation is used to define new elements in the linear 
combination, rather than relying on a functional basis or dictionaries. Using Laplace approx- 
imation has the advantage that it is automatically centered at the point, where the current 
approximation is worst, i.e., where the residual is largest. In addition, the approximation is 
a linear combination of multivariate normal densities. This is convenient from a statistical 
and computational viewpoint {e.g., sampling random variates and evaluation of the density 
are straightforward). 

At the end of the iterative process one hence obtains a global approximation of n^x), which 
can be used for subsequent statistical inference directly, or as a proposal distribution for 
Monte Carlo simulation (such as importance sampling or the independence Metropolis- 
Hastings algorithm (Tierney 1994, ch. 2.3)). Building global approximations to obtain 
proposal distributions has received considerable attention in recent years. An idea appear- 
ing repeatedly is that of building a mixture based approximation, while performing the 
iterative simulation algorithm. In the context of importance sampling this has been consid- 
ered for example by Cappe, Douc, Guillin, Marin & Robert (2008) or Ardia, Hoogerheide 
& van Dijk (2009). In the context of adaptive MCMC, mixture based approximations have 
been discussed for example by Andrieu & Moulines (2006, ch. 7) or Giordani & Kohn (2010). 

We think that iterating Laplace approximations can be a viable alternative to current ap- 
proaches for building global approximations of 7t{x)/Z. In Section 2 this idea will be elabo- 
rated in detail, while in Section 3 the methodology will be evaluated on three test examples 
and a nonlinear regression application. 
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Figure 1: Illustrating the iterated Laplace approximation. 

2 Iterated Laplace Approximations 

The main idea of iterated Laplace approximations, abbreviated iterLap, is to improve the 
current approximation tt{x) of 7r(a3) by performing a Laplace approximation of the residual 
r{x) = 7t{x) — tt{x), and then taking Wi7f{x) + W2rL{x) as the new approximation, where 
tl^x) is the (normalized) Laplace approximation of r{x) {i.e., a multivariate normal density) 
and the Wj are suitably determined positive coefficients. This procedure is then iterated 
until a satisfying approximation is achieved. The final approximation of 7t{x) will be a 
linear combination of multivariate normal densities and the normalizing constant of the 
approximation can be obtained simply by adding the coefficients Wj. The residual r{x) can 
get negative, which causes a problem when calculating log(r(a;)). Hence one should use, for 
example, the positive function r{x)lA{x) 4- exp(r(a3) — e)elyic(a;) as the objective function, 
where A = {x\r{x) > e} and e is a small positive constant. This does not change the residual 
in the region of interest, where r{x) > 0. 

The procedure will be illustrated here by fitting the one-dimensional skew non- normalized 
density 7i{x) = 0, 1) + O.50(x, —3, 2), where a) denotes the density of the normal 

distribution with mean /i and standard deviation a. In Figure 1 (i) the true function 7r{x) 
and the Laplace approximation 7f(x) based on the global maximum are displayed. Due to 
skewness, the Laplace approximation fits poorly in the left part of vr(x), where the residual 
function r{x) consequently has its maximum. Figure 1 (ii) displays the new approximation 
based on a mixture of the initial Laplace approximation and the Laplace approximation 
of the residual r{x). The coefficients Wi and W2 of the two components were chosen to 
minimize the L2 distance between approximation and truth 7r{x). The approximation fits 
the true function fairly well, despite a remaining small discrepancy in the left tail, which 
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could be eliminated by a further iteration of the algorithm. 

While the conceptual idea of iterating Laplace approximations is fairly simple, care must be 
taken when implementing the idea. In the following the computational details of iterated 
Laplace approximations will be described. 

Algorithm 

Iteration 0: 

1. Fit a Laplace approximation to each mode of 7r(a;), to obtain a starting approximation: 
7ro(a;) = ^^.^-^ /x^-, Sj), where 0(a;,/i, S) denotes the density of a multivariate 
normal distribution with mean and covariance matrix S, J*^") the number of found 
modes, /x^ the modes, S-,- the negative inverse Hessians of log(7r(a;)) evaluated at the 
modes and Wj = (27r)P/^|Sj|^/^7r(/Xj); see Gelman et al. (2003, Chapter 12) for details 
on this multiple mode Laplace approximation. 

2. Determine for each component in the linear combination a grid of size n that encloses 
most of its probability mass (see A.l below for more details). Let Xq denote the 
nJ*-"-* X p matrix that contains these grid points in the rows. 

3. Evaluate 7r(.) at Xq, resulting in the vector t/q of length nJ^^\ Also evaluate each of 
the J'^^^ component densities in the mixture at Xq and write those evaluations in the 
nJ(°) X matrix Fq. 

4. Check the stopping criterion (see A. 2 below for details); if this is not met initialize 
t ^ 1. 

Iteration t: 

1. In this step the residual Laplace approximation is performed to obtain one new mix- 
ture component. Select k possible starting values for optimization of the log-residual 
log(r(a;)), with r{x) = 7r{x) — nt^x) (see A. 3 below for details on selecting starting 
values). Start a local optimizer at the first starting value, potentially resulting in a 
maximum x with zero gradient vector. If the Hesse matrix i3" at i is negative definite, 
a new mixture component has been found, i.e., one increments the number of compo- 
nents J*-*-* ^ J^^~^^ + 1, and sets /Xj{t) = x and = —H ^ . Otherwise, if H is not 
negative definite one tries the next starting value. If all of the k starting values fail, 
stop the procedure, as no adequate improvement can be found; otherwise continue to 
step 2. 
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2. Determine a grid Nt of size n for the new component that encloses most of its proba- 
bihty mass (see A.l). Add these points to the current grid Xt^i to form Xt = ( '^jif^^ ) , 
which is then of size n J^*-* x p. 

3. Evaluate 7r(.) at Nt and append these evaluations to yt_i to form Evaluate all 
components of the approximation nt-i{.) at Nt and the new component at the entire 
grid Xt to form the nJ^*-* x J^*) matrix Ft- 

4. Find the coefficients wi, . . . , Wj{t) by minimizing {yt — F[wy{yt — F[w) subject to Wj > 
for j = 1, . . . , J^*^ (see A. 4 below for details) and calculate the current approximation 
of the normalization constant Zt = Xljli^i- The current approximation of tt(x) is 
then 7rt{x) = Y.j=i Wj(f){x, fij, T,j). 

5. Check the stopping criteria, if they are not met iterate t <(— t + 1 (see also A. 2). 



The points below illustrate computational details in the implementation of the algorithm. 



A.l Grid 

The reason for choosing the grid is to identify regions where 7r(.) has positive probability 
mass. In our experience it works well to use a quasi-random sample of the multivariate 
normal distribution underlying the selected component. For this purpose a randomized 
quasi-random sample generated by the Sobol sequence is used (as implemented for 
example in the R package randtoolbox, Dutang (2009)). Compared to a pseudo 
random sample this has the advantage that the space is more systematically covered. 
A default choice of n is discussed below. 

A. 2 Stopping Criteria 

Different criteria can be used for stopping the iterative process. First one can compare 
ijt and ift = F[wt, i.e. stopping the iterative process, when max|^/j — < 6Mt, where 
Mt = max7r(a;) and 5 is a small positive number. This criterion assesses the quality 
of the approximation on the current grid, and stops the process, when there are only 
small differences between truth and approximation. This criterion is already available 
at iteration 0. Another stopping rule is to monitor the normalizing constant Zt of 7tt{x). 
It measures the "volume" of the approximation ntix); if Zt does not change further 
this indicates that the algorithm cannot find more regions, where 7r(a;) has relevant 
probability mass. It is not uncommon that two consecutive iterations only lead to 
small changes in the normalizing constant, so we stop the iterative process, when Zt 
does not change considerably the third time in a row, i.e., when < 
where e is a small positive constant. Note that both of the above stopping criteria (as 
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most stopping criteria for iterative calculations), do not guarantee a certain quality of 
the solution, but with sensible values for e and 6 (default choices are discussed below) 
one often obtains a satisfactory result. Note that the procedure can also stop, when no 
adequate modes can be found in step 1 of iteration t, or when a maximal pre-specified 
number of components T is reached. All of the above criteria will be employed in the 
examples discussed in the later sections. 

A. 3 Starting values 

The residual function r{x) is often multimodal. Because the used optimizers are de- 
signed for finding local optima, it is crucial to use good starting values. In our expe- 
rience it works well to use starting values where yi/fji, i = 1, . . . ,nJ^*^ is largest; here 
Hi and jji denote the entries of and y^. This choice works better than, for example, 
using the values where i/i — iji is largest, because the former selects values further apart 
from the current modes. In our implementation the ten grid values for which the yi/iji 
is largest are selected, and then clustered with the /c-means algorithm (Hartigan & 
Wong 1979) to obtain k = 3 starting values. The order in which the starting values 
are used is determined by the distance to the last added mode, with the values far- 
thest away being tried first. This prevents the algorithm from wandering in only one 
direction. 

A. 4 Quadratic Programming 

Solving the constrained least squares problem in step 4 is a quadratic programming 
problem and can be solved efficiently for example with the algorithm of Goldfarb & 
Idnani (1982). 

In summary the algorithm needs: The grid size n for each component, the values e and 6 
for the two stopping criteria and the maximum number T of components allowed. Suited 
default values for those parameters have been determined by experimentation based on a 
number of example densities, covering a range of different cases observed in practice. A 
good time-quality trade off for the grid size was obtained for the smallest integer larger than 
rjj^g number 50 has been selected based on the observation that a grid size of 50 is 
often sufficient in one dimensional cases, the exponent 1.25 has been chosen to obtain a grid 
size that grows slightly faster than linear in the dimension. For 6 and e the choices 6 = 0.01 
and e = 0.005 were found to work well. The main rationale for these selections is that one 
should not stop the process before all relevant probability mass has been identified. On the 
other hand, stopping late will take more time (both for construction of the approximation, 
as well as for sampling) with only a marginal improvement. For the maximum number of 
components, T = 20 was sufficient in the considered examples. All subsequent applications 
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Figure 2: Contour plot for the three test densities. For /a only the two first coordinates are 
plotted. Black dots (•) denote local maxima and the contour lines are located at 0.9, 0.5, 
0.25, 0.05 and 0.01 of the height of the global maximum. 

in this paper rely on these default assumptions. When these choices fail, i.e., the algorithm 
fails to identify parts of the probability mass, one can increase n, or decrease 6 and e. Another 
strategy is to increase the number of starting values at iteration 0, if complete modes might 
have been missed at the beginning. 

The algorithm was implemented in the R computing language (R Development Core Team 
2010). To avoid fioating point errors we work on the log-scale whenever possible. When 
one needs to work on the original scale, for example, in the quadratic programming step 
or in the optimization of log(r(a;)), one can use 7r*{x) = exp(log(7r(a?)) — log(Mf)), where 
log(Mf) = maxlog(7r(a3)). In this way excessively small values are avoided. 



3 Applications 



3.1 Test Cases 

In this section the method is illustrated for three artificial, yet realistic test cases. Com- 
mon challenges for computational approaches in Bayesian statistics are skew, non-linear and 
multimodal posterior distributions. These often occur in applications beyond the standard 
statistical models, for example, when the statistical model is not from an exponential family 
with conjugate prior and linear predictors. 
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The first density /i is a bivariate skew t— distribution with 5 degrees of freedom, scale matrix 
^ = (-0 9~i'^)) location vector ^ = (0,0)' and skewness vector cx = (0,15)' (see Azzalini 
& Capitanio (2003) for details on the parametrization) . This density (displayed in Figure 2 
left) possesses an extreme non-elliptical skew shape. The second density /2 is a mixture of 
three bivariate normal distributions: f2{x) = O.340(a;, (0, 0)', Si) +0.33(j){x, (—3, —3)', >S'2) + 
0.330(3^,(2,2)', 53), with 5i = (1?), 52 = (o^gV) and = (^^.gT) (Gilks, Roberts 
& Sahu 1998). This density is multimodal and has a complex local structure, see Figure 2 
(middle). The third density is the ten dimensional non-linear banana shaped distribution 
used in Wraith et al. (2009), with density fsi^x) oc (f)(t{x),m, S), where t{x) = {xi,X2 + 
h{x\ — ), xs, . . . , xio), m = (0, . . . , 0)' and S = dia.g{af, 1, . . . , 1), 6 = 0.03 and af = 100 
were used, as in Wraith et al. (2009). Figure 2 right, displays the density of the first two 
components, when the other coordinates are fixed at 0. 

iterLap has been applied to these problems with the default tuning parameters. As starting 
values for the optimization in the initial Laplace approximation in each case one vector 
consisting only of zeros has been used. For fi iterLap selects 9 components and stops the 
iterative process, because the algorithm cannot further change the approximation of the 
normalization constant. For /2 the algorithm stops the iterative process after 5 components, 
because the maximum error on the grid points is achieved. For the iterative process is 
illustrated in some detail in Figure 3, which displays the selected 11 components and the 
order in which they are selected. The algorithm stops, because the normalization constant 
of the approximation does not further change considerably. In Figure 3 one can also see that 
the orientation of the selected components' ellipses fits the underlying local structure of the 
distribution quite well. 

As distance measure to the true density the normalized effective sample size (NESS) has 
been calculated by an application of importance sampling with the obtained mixture of 
normal distributions as proposal. NESS lies in (0,1], where larger values correspond to 
a better fit and NESS is an estimate of a monotonic transformation of the x^"distance 
between proposal and true distribution (Kong, Liu & Wong 1994). It is defined as NESS = 
1 / (A^ Xlili ^i)^ where A^ is the number of simulated values and a)j the normalized importance 
weights. The reported values of NESS in Table 1 are the average over 100 independent runs 
of importance sampling each with sample size A^ = 10000 (standard deviation given in 
brackets). In addition the means and standard deviations of the marginal distributions have 
been compared. For this purpose the absolute distance between true and approximated 
mean and standard deviation were used and divided by the true standard deviation of the 
corresponding marginal distribution. Deviations from the true values are thus measured in 
units of the true standard deviation. 
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Figure 3: Illustration of iterLap for the first two dimensions of /a. The numbers in the dots 
refer to the order in which components were added. 



Density 


Method 


NESS 


MeaUa;^ 


Sda;, 


Meaujjj 


sdx2 


/i 


Laplace 
iter. Lapl. 


0.04 (0.04) 
0.65 (0.20) 


0.72 
0.02 


0.49 
0.16 


0.88 
0.05 


0.59 
0.11 


/2 


Laplace 
iter. Lapl. 


0.02 (0.01) 
0.99 (<0.01) 


0.13 
<0.01 


0.56 
<0.01 


0.13 
<0.01 


0.56 
<0.01 


/3 


Laplace 
iter. Lapl. 


0.05 (0.04) 
0.71 (0.04) 


<0.01 
<0.01 


<0.01 
0.14 


0.70 
0.15 


0.77 
0.08 



Table 1: Normalized Effective Sample Size (NESS) and approximation error in the marginal 
mean and standard deviation relative to the true standard deviation for the Laplace approx- 
imation and the iterated Laplace approximation of /i, /2 and /s. 
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Table 1 displays the results for iterLap and for a standard Laplace approximation (which 
was used as the starting approximation for iterLap). One can conclude that the iterated 
Laplace approximation has a substantially better performance than the standard Laplace 
approximation particularly in terms of NESS, but also for the moments of the marginal 
distributions. 

The results regarding the ten dimensional /s are also quite encouraging: Compared to the 
results of Wraith et al. (2009, Figure 3), who use adaptive importance sampling based on 
Cappe et al. (2008), one can observe that a similar median effective sample size was obtained 
for 10 iterations of adaptive importance sampling. The adaptive importance sampling, how- 
ever, has a larger variability and needs a total of 100000 function evaluations in each case. 
Our approach needs a total of around 19000 function evaluations for building the approx- 
imation, including evaluations for building the grid as well as evaluations needed for the 
optimizer and calculating the Hessians. The number of function evaluations is an important 
machine independent indicator on how fast an algorithm runs, as the other computations 
needed by the algorithms can usually be neglected, particularly if evaluation of the target 
distribution is computationally expensive. To get an idea of the actual computation time 
needed by our R implementation (which does not exploit that parts of the code can be paral- 
lelized): Building the global approximation for takes around 2.5 seconds (using a Laptop 
computer with 1.86 Ghz and 2GB RAM). 



3.2 Nonlinear Regression 

To illustrate iterLap on a real problem, data on monthly averaged atmospheric pressure differ- 
ences between the Easter Islands and Darwin, Australia over 168 months are used (see Figure 
4). These data are taken from the NIST website www.itLnist.gov/div898/strd/nls/data/enso.shtmL 
The difference in pressure is of meteorologic importance as it drives the trade winds in 
the southern hemisphere, and the main purpose of the data analysis is to infer the fre- 
quency of periodic cycles. The model for the data is i/i ~ N{fi{i),(T'^), where /i(i) = 
a + sin(27ri/Afc) + -B^ cos (2 m/ A a,.) for i = 1, . . . , 168. For the conditionally linear 

parameters a,Ak,Bk, k = 1,2,3 independent Cauchy a-priori distributions with median 
and scale 100 (for a) and 10 (for Ak, Bk, k = 1,2,3) were used. For the positive parameter 
Afc independent uniform distributions on [0, 100] were employed and for a a gamma distri- 
bution with parameters 0.1 and 0.1. The transformation log(cr) has been used, to obtain a 
parameter that lies in M. All parameters will be summarized in the vector 0. The likelihood 
surface is highly multimodal, but there seems to be one dominant mode. As starting value 
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Month 

Figure 4: Pressure Difference between the Easter Islands and Darwin, Australia with smooth- 
ing spline fit (black line) to give an idea of the conditional mean function. 

for the optimization the least squares estimate shown on the NIST website is used (note 
however that here also Ai is treated as unknown, while in their analysis this is fixed to 12). 

To compare iterLap to other computational methods, four long MCMC runs were produced 
(each started at the mode and of size 2500000 with thinning 10 after a burn-in of 10000) and 
the resulting 1000000 iterations are used as a gold-standard. These runs were obtained with 
a multivariate random walk Metropolis algorithm implemented in the mcmc package of Geyer 
(2010), with proposal variance matrix 2.38^/llS, where S is the negative inverse Hessian 
at the mode of the log-density. 

Using the default tuning parameters, iterated Laplace approximation selects 12 components 
and stops the iterative process, because the normalization constant of the approximation does 
not further improve. For this application the obtained distribution will be used as a proposal 
distribution for importance sampling. As is common in importance sampling, the Gaussian 
approximations obtained from iterLap will be replaced by mixtures of t-distributions with 
equal centering vector and scale matrix and 10 degrees of freedom, because those possess 
heavier tails. To obtain an unweighted sample, importance sampling resampling with residual 
resampling was used with sample size 5000 (see Robert & Casella (2004, Chapter 14) for 
details on residual resampling). 

The iterLap procedure will be compared with two MCMC-based approaches. First the com- 
ponentwise adaptive random walk Metropolis algorithm is used (see Roberts & Rosenthal 
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Figure 5: Simulation results for the three methods over 100 repetitions. iLap = importance 
sampling with iterLap, aMwG. = adaptive Metropolis within Gibbs, oRWM = optimal random 
walk Metropolis with true covariance matrix. 

(2009)). To improve the componentwise updating, the algorithm is applied on the trans- 

~ 1/2 _ _ ~ 1/2 

formed scale 6 = Y, + /x, with /x the posterior mode and I] the square root of the 

negative inverse Hessian at the mode calculated from the eigen decomposition. This has the 
advantage that the covariance matrix is approximately diagonal on the transformed scale. For 
implementation the function adaptMetropGibbs from the spBayes package (Finley, Baner- 
jee & Carlin 2010) was used with target acceptance rate set to 0.44 and initial proposal 
standard deviations set to 2.38. The second MCMC approach is an optimally tuned mul- 
tivariate random walk Metropolis algorithm. The covariance matrix of the proposal was 
chosen proportional to the covariance matrix obtained from the gold-standard runs above, 
with the scaling selected as 2.38^/11. This is an algorithm that cannot be used in practice, 
because the "true" covariance matrix is unavailable a-priori. Nevertheless, this algorithm is 
of interest as it provides an upper bound on the performance of the adaptive random walk 
Metropolis algorithm (as described for example in Roberts & Rosenthal (2009)), which needs 
to estimate the covariance matrix from its own iterations. In all cases the starting value is 
chosen equal to the mode of the posterior distribution. 

Four performance measures are used to compare the methodologies. First, the estimates of 
the posterior means are compared by calculating the Mahalanobis distance M{6, Otrue) = 

\J {6 — Otrue)' S~[j\i,e(^ ~ ^true), betwecu the empirical average from the simulation output 
and the values Otme and Stme obtained from the gold-standard runs. To compare the es- 
timates of the parameter covariance matrix, the spectral norm of the difference between 
the empirical and true covariance matrices C{S, Stme) = \/ ^max{D' D) was calculated, 
where D = S — Stme and Xmax{-) is the function that returns the largest eigenvalue. To 
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obtain measures of multivariate goodness of fit, the quantiles g^os' • • • 5 Qo.L along 

the first two eigenvectors Cj, j = 1,2 of the true correlation matrix were calculated from 
the gold-standard runs and then compared to the quantiles obtained from the simulations 
Qiej) = ^, E |#-g?|,forj = l,2. 

te{0.05,0.1,...,0.95} 

All three algorithms were applied to the problem and repeated 100 times. The iterLap pro- 
cedure requires around 28000 function evaluations values in total (both for building the 
approximation and the 5000 evaluations for importance sampling). For the MCMC based 
algorithms twice as many function evaluations namely a total of 60000 iterations were al- 
lowed, with 10000 burn-in and thinning rate 10, so that in summary also 5000 iterations are 
obtained. From the results displayed in Figure 5 one can conclude that importance sampling 
with iterLap works very well compared to the other approaches in all performance measures 
(particularly for the posterior moments), although it uses fewer function evaluations. In 
addition it is easy to obtain a reliable estimate of the normalizing constant via importance 
sampling, while this is more complicated to obtain reliably from MCMC output. 

Returning to the original aim of the data analysis, the resulting posterior means for the 
frequency parameters (Ai, A2, A3) are given by (11.9,44.1, 26.8)' months with standard devi- 
ations (0.04,1.1,0.36)'. These frequencies can be traced to the yearly cycle (Ai), El Nino 
(A2) and the Southern Oscillation (A3), see the NIST website for details. 



4 Discussion 



Compared to traditional function approximation or regression function estimation, globally 
approximating a positive integrable function tt^x) proportional to a probability density is 
a considerably more difficult problem. The main complication is that it is a-priori unclear 
where to approximate '/r(a;), i.e., where most of the probability mass of 7r{x) is located. In 
this article the iterated Laplace approximation has been introduced to solve this twofold 
problem of identification of regions with relevant probability mass, and approximation of 
it{x) in these regions. The methodology starts with a simple Laplace approximation, and 
then iteratively applies Laplace approximation to the residual between truth and current 
approximation, until a stopping criterion is satisfied. By optimizing the residual in each 
step of the procedure, the algorithm identifies regions with relevant probability mass, where 
the current approximation fits poorly and an improvement is needed. Once a mode and the 
local curvature is determined, the new component is added to the approximation, and the 
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coefficients are determined by minimizing the L2 distance between truth and approximation 
on a grid. 

In this paper the methodology has been evaluated in three test cases and one real exam- 
ple with positive results. In the case of the ten dimensional banana shaped example, the 
approach obtained similar results as adaptive importance sampling based on a mixture of t 
distributions with improved computational efficiency in terms of function evaluations. Fur- 
ther, in the real data example, where two state of the art MCMC algorithms were applied, 
the iterated Laplace approximation showed a very competitive performance with a smaller 
number of function evaluations. 

As for all analytical approximations, it is difficult to assess the quality of the obtained iterated 
Laplace approximation in a concrete modelling situation. Hence, its main value is to use 
it as a proposal distribution for Monte Carlo techniques. These techniques allow to assess 
the quality of the approximation and correct for deficiencies of fit by rejecting or weighting 
samples. While the focus in this paper has been on importance sampling techniques (where 
the effective sample size can be used to assess the quality of the approximation), one can, 
of course, also use the approximation in the context of MCMC techniques that employ a 
global proposal distribution, such as the independence Metropolis-Hastings algorithm or the 
rejection sampling Metropolis-Hastings algorithm (Tierney 1994, ch. 2.3), in these cases 
reliable MCMC standard errors can be calculated to evaluate the quality of the simulation 
(see Flegal, Haran & Jones (2008)). 

A primary application of the methodology might be non-linear models as applied in diverse 
fields, for example early phase clinical trials or cosmology. In these models the posterior 
distribution can be skew and multimodal, and one usually cannot design Gibbs moves to 
directly sample from the full conditional distributions. Nevertheless, the algorithm was also 
tested with success on a variety of other applications, for example dose-response estimation, 
Gaussian process regression and simple hierarchical models. 

In higher dimensional problems it gets difficult to build a global approximation of non-trivial 
posterior distributions, and the proposed methodology is no exception: In these cases local 
MCMC moves often become more efficient, although the iterated Laplace approximation 
typically still provides an improvement over the standard Laplace approximation in terms of 
building a global approximation of the posterior and approximating the normalization con- 
stant. A computational concern with regard to the methodology is the need for numerically 
calculating the Hessian matrices. Depending on the problem, this might become unstable 
(for example when the objective function is flat in the neighborhood of the mode, or the 
mode lies on a ridge) and, in larger dimensions, computationally expensive. A partial so- 
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lution, as suggested by a referee, is to use a structured form of the covariance matrix {e.g., 
a diagonal matrix). This reduces the computational burden in high dimensional cases and 
stabilizes computations. The downside would be that a worse fit is obtained by the added 
components and it is likely that more mixture components are required to obtain an ade- 
quate approximation. Another challenge for the iterLap methodology are situations when the 
target density contains a large number of strongly separated modes. A partial solution in 
these cases is to use more widely dispersed starting values for the starting approximation (at 
iteration 0). Alternatively, one could also consider to use tempered version of the residual 
function to avoid getting trapped in minor local modes. 
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